Spark-BDD: Debugging Big Data Applications
نویسندگان
چکیده
Apache Spark has become a key platform for Big Data Analytics, yet it lacks complete support for debugging analytics programs. As a result, the development of a new analytical toolkit can be a painstakingly long process [7, 2, 4]. To fill this gap, we are developing Spark-BDD (Big Data Debugger), which brings a traditional interactive debugger experience to the Spark platform. Analytic programmers (e.g., data scientists) can leverage Spark-BDD interactive debugging capabilities to set breakpoints and watchpoints, trace forward/backward through a program execution, perform function hot-swapping at runtime, and many other features that we believe will greatly reduce the overall development cycle. We surveyed several toolkits for debugging large-scale distributed data processing programs [3, 5, 6, 1]). Slide 1) summarizes debugging features that are currently supported or proposed in individual toolkits. Inspector Gadget (IG) [6], Newt [5], and RAMP [3] are built on Hadoop, while Arthur [1] targets Spark. Interactively setting breakpoints and watchpoints are common debugger features in modern integrated development environments such Eclipse IDE or Visual Studio, yet such runtime inspection features are not well supported in large-scale data processing platforms. The reason is that the surveyed toolkits are limited to work in an offline mode, and therefore only support post-mortem debugging analysis. Moreover, debugging toolkits that target Hadoop are not provided with an interactive query interface. Therefore, our work focuses on bringing interactive (online) debugger capabilities to the Spark platform. Spark-BDD provides runtime inspection of Spark’s distributed dataflow computation, enabling many debugging features essential to modern development environments: e.g., breakpoints, watchpoints, and step-through debugging. Furthermore, Spark-BDD provides profiling and performance monitoring primitives such as latency alerts, which aid in identifying stragglers. Spark-BDD also provides a new feature called “function hot-swapping” to allow users to swap transformation functions at runtime with alternative logic. This feature can effectively support iterative trial and error debugging, where users modify existing transformation functions to see whether it removes errors or crashes. Among the surveyed toolkits, Arthur [1] is the only one that targets Spark. However, it does not provide interactive user interfaces to set breakpoints and watchpoints and perform “what-if” analysis by swapping a transformation function at runtime. Spark-BDD aims to support all features in Slide 1 through three key mechanisms: i) data lineage information, ii) incremental dataflow computation, and iii) runtime level profiling (similar to IG [6]). Data lineage is captured within the Spark runtime, and is surfaced to the programmer as a Spark RDD, which can then be queried by the Spark language, with additional debugger specific functionality e.g., trace forward/backward from specific data values in the RDD. More specifically, features 1 − 6 and 12 − 14 are directly or indirectly build on top of captured data lineage. Incremental dataflow computations—that we are adding to Spark—enables replay and function hot-swapping, allowing programmers to change program logic (perhaps in response to an exception) and replay the execution from that point, efficiently i.e., by recomputing from the difference between the new result and the prior result. Furthermore, we are instrumenting Spark to capture profiling information, such as record processing times, to support features 7 and 8. We envision such profiling will aid in tuning shuffle steps e.g., finding the right number of shuffle partitions that minimizes skew and does not exceed (nor waste) task resources (e.g., RAM). Our presentation will consist of i) the technical details behind our instrumentation of Spark, and the major technical challenges in minimizing the performance impact, ii) a set of concrete use-case applications, the major problems (i.e., bugs) developers faced, and how our debugger helped solve the problem, and iii) we will prepare a live demo of Spark-BDD, showcasing its features. We hope to receive feedback and novel uses-cases from the HPTS community to guide our further efforts.
منابع مشابه
Interactive Debugging for Big Data Analytics
An abundance of data in many disciplines has accelerated the adoption of distributed technologies such as Hadoop and Spark, which provide simple programming semantics and an active ecosystem. However, the current cloud computing model lacks the kinds of expressive and interactive debugging features found in traditional desktop computing. We seek to address these challenges with the development ...
متن کاملTrends and Challenges in Big Data Processing
Almost six years ago we started the Spark project at UC Berkeley. Spark is a cluster computing engine that is optimized for inmemory processing, and unifies support for a variety of workloads, including batch, interactive querying, streaming, and iterative computations. Spark is now the most active big data project in the open source community, and is already being used by over one thousand org...
متن کاملHDM: A Composable Framework for Big Data Processing
Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of developing big data programs and applications. However, the jobs in these frameworks are roughly defined and packaged as executable jars without any functionality being exposed or described. This means that deployed jobs are not natively composable and reusable for subsequent development. Beside...
متن کاملHow Data Volume Affects Spark Based Data Analytics on a Scale-up Server
Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not...
متن کاملTitian: Data Provenance Support in Spark
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data ...
متن کامل